We have an exciting role as below in Hyderabad for a AI SaaS Fintech Product Firm.
SRE DevOps Lead Engineer (SaaS) || 8-12 Y || Hyderabad (Hybrid) || Quick Starter ||
Key Responsibilities:
Architect, design, and deploy end-to-end infrastructure solutions for a multi-tenant microservices-based SaaS application with a focus on AI/ML model integration.
Ensure system reliability, scalability, performance, and security, specifically enhancing AI/ML processing pipelines and workflows.
Utilize Terraform scripting for on-demand environment provisioning within the AWS cloud, optimized for AI/ML workloads.
Implement and refine monitoring and alerting systems across application, network, and OS layers to support AI model operations and data processing.
Diagnose, support, and resolve production issues and alerts, participating in a 24/7 on-call rotation to maintain seamless AI/ML service operations.
Scope Of Work:
Actively participate in the Scrum team, delivering test automation for sprint features and ensuring high-quality product increments by certifying new and regression features using automated test suites
Integrate automated tests into the CI/CD pipeline and schedule them to run periodically in product development environments
Identify defects, collaborate with development engineers to resolve them, and verify the fixes
Maintain continuous availability in alignment with startup culture, staying informed and up-to-date with communications across various channels and email threads
Focus on the primary goal of minimizing customer-reported bugs to near zero.
Required Qualification:
8+ years of experience in Site Reliability Engineering (SRE) and DevOps roles with a track record of managing large-scale enterprise SaaS services in production, including 1+ year in AI/ML infrastructure
Demonstrated expertise with AWS public cloud technologies, including extensive experience in deploying and managing large-scale container clusters using AWS, EKS.
Skilled in Infrastructure as Code (IaC) using Terraform, and container technologies such as Docker and Kubernetes.
Proficient in scripting and programming for automation (Python, Bash, etc.), with strong Linux OS and networking fundamentals relevant to AI/ML workloads.
Experience in establishing monitoring systems to ensure high availability, performance, and security integrity, using tools like ELK Stack, CloudWatch, and others tailored for AI/ML monitoring.
Hands-on experience managing microservices architecture SaaS products, enabling RESTful web services, SSO integration (Okta, Auth0), and utilizing cloud databases like EC2-RDS, MySQL, and Elasticsearch, especially in AI/ML deployments.
Proficient in backup and disaster recovery strategies specific to AI/ML data resources like RDS and Elasticsearch.
AWS Certified Solutions Architect is strongly preferred.
Self-driven, proactive, and adaptable to thrive in an early-stage startup environment, with a keen interest in integrating AI/ML technologies into modern SaaS solutions.
Strictly, prefer applicants with stable career (consistent employment) within 0-30 days NP only!